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Abstract 

The standard loss functions used in the literature on probabilistic pre¬ 
diction are the log loss function, the Brier loss function, and the spherical 
loss function; however, any computable proper loss function can be used 
for comparison of prediction algorithms. This note shows that the log loss 
function is most selective in that any prediction algorithm that is opti¬ 
mal for a given data sequence (in the sense of the algorithmic theory of 
randomness) under the log loss function will be optimal under any com¬ 
putable proper mixable loss function; on the other hand, there is a data 
sequence and a prediction algorithm that is optimal for that sequence un¬ 
der either of the two other standard loss functions but not under the log 
loss function. 


1 Introduction 

In the empirical work on probabilistic prediction in machine learning (see, e.g., 
[2]) the most standard loss functions are log loss and Brier loss, and spherical loss 
is a viable alternative; all these loss functions will be defined later in this note. It 
is important to understand which of these three loss functions is likely to lead to 
better prediction algorithms. We formalize this question using a generalization 
of the notion of Kolmogorov complexity called predictive complexity (see, e.g., 
[6]; it is defined in Section 3). Our answer is that the log loss function is likely 
to lead to better prediction algorithms as it is more selective: if a prediction 
algorithm is optimal under the log loss function, it will be optimal under the 
Brier and spherical loss functions, but the opposite implications are not true in 
general. 

As we discuss at the end of Section 3, the log loss function corresponds to the 
classical theory of randomness. Therefore, our findings confirm once again the 
importance of the classical theory and are not surprising at all from the point 
of view of that theory. But from the point of view of experimental machine 
learning, our recommendation to use the log loss function rather than Brier or 
spherical is less trivial. 

This note is, of course, not the first to argue that the log loss function is 
fundamental. For example, David Dowe has argued for it since at least 2008 
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([3], footnote 175; see [4], Section 4.1, for further references). Another paper 
supporting the use of the log loss function is Bickel’s [1]. 

2 Loss Functions 

We are interested in the problem of binary probabilistic prediction: the task is 
to predict a binary label y G {0,1} with a number p G [0,1]; intuitively, p is the 
predicted probability that y = \. The quality of the prediction p is measured 
by a loss function A : [0,1] x {0,1} —> R U {+(X)}. Intuitively, X{p, y) is the loss 
suffered by a prediction algorithm that outputs a prediction p while the actual 
label is y; the value +oo (from now on abbreviated to oo) is allowed. Following 
[9], we will write Xy{p) in place of X{p,y), and so identify A with the pair of 
functions (Aq, Ai) where Aq : [0,1] —>■ K U {+cx)} and Ai : [0,1] —>■ R U {+cx)}. 
We will assume that Ao(0) = Ai(l) = 0, that the function Ao is increasing, that 
the function Ai is decreasing, and that Xyip) < oo unless p G {0,1}. 

A loss function A is called y-mixable for p G (0, oo) if the set 

{(u,u) G [0,1]2 I G [0,1] : u < and v < 

is convex; we say that A is mixahle if it is ? 7 -mixable for some p. 

A loss function A is called proper if, for all p,q € [0,1], 


]EpA(p, •) < EpA(9, •), (1) 

where Ep / := pf{l) + (1 — p)/(0) for / : {0,1} —>■ R. It is strictly proper if the 
inequality in (1) is strict whenever q ^ p. 

We will be only interested in computable loss functions (the notion of com¬ 
putability is not defined formally in this note; see, e.g., [6]). We will refer to 
the loss functions satisfying the properties listed above as CPM (computable 
proper mixable) loss functions. 

Besides, we will sometimes make the following smoothness assumptions'. 

• Ao is infinitely differentiable over the interval [0,1) (the derivatives at 0 
being one-sided); 

• Ai is infinitely differentiable over the interval (0,1] (the derivatives at 1 
being one-sided); 

• for all pG (0,1), {X'q{p),X'^{p)) ^0. 

We will refer to the loss functions satisfying all the properties listed above as 
CPMS (computable proper mixable smooth) loss functions. 

Examples 

The most popular loss functions in machine learning are the log loss function 
Ai(p):=-lnp, Ao(p) :=-ln(l-p) 
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and the Brier loss function 


Hp,y) ■= {y-pf- 


Somewhat less popular is the spherical loss function 


Xi{p) := 1 - 


P 

^Jp'^ + (1 -p)2’ 


Ao(p) := 1 - 


1 -P 

(1 - p )2 


All three loss functions are mixable, as we will see later. They are also com¬ 
putable (obviously), strictly proper (this can be checked by differentiation), and 
satisfy the smoothness conditions (obviously). Being computable and strictly 
proper, these loss functions can be used to measure the quality of probabilistic 
predictions. 


Mixability and Propriety 

Intuitively, propriety can be regarded as a way of parameterizing loss functions, 
and we get it almost for free for mixable loss functions. The essence of a loss 
function is its prediction set 

{(Ao(p),Ai(p)) I p e [0,1]}. (2) 

When given a prediction set, we can parameterize it by defining (Ao(p), Ai(p)) 
to be the point {x,y) of the prediction set at which inf(a; j,)(pj/ + (1 — p)x) is 
attained. This will give us a proper loss function. And if the original loss 
function satisfies the smoothness conditions (and so, intuitively, the prediction 
set does not have corners), the new loss function will be strictly proper. 


3 Repetitive Predictions 

Starting from this section we consider the situation, typical in machine learning, 
where we repeatedly observe data 2i, Z 2 , • • • and each observation zt = {xt,yt) G 
Z = X X {0,1} consists of an object Xt £ 'K and its label yt € {0,1}. Let us 
assume, for simplicity, that X is a finite set, say a set of natural numbers. 

A prediction algorithm is a computable function F : Z* x X —>■ [0,1]; intu¬ 
itively, given a data sequence a = ( 21 ,... , zt) and a new object x, F outputs 
a prediction F{a, x) for the label of x. For any data sequence a = {zi,, zt) 
and loss function A, we define the cumulative loss that F suffers on cr as 

T 

Loss^ (cr) := E X{F{zi,... ,zt-i,xt),yt) 

t=i 

(where Zt = {xt,yt) and 00 -f a is defined to be 00 for any a G M U { 00 }). 
Functions Lossp : Z* —> R that can be defined this way for a given A are called 
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loss processes under A. In other words, L : Z* ^ R is a loss process under A if 
and only if L{D) = 0 (where □ is the empty sequence) and 

Va e Z* Vx e X G [0,1] Vj/ G {0,1} : L{a, x, y) = L{a) + \{p, y). (3) 

A function L : Z* —> R. is said to be a superloss process under A if (3) holds 
with > in place of =. If A is computable and mixable, there exists a smallest, 
to within an additive constant, upper semicomputable superloss process: 

VL 2 3c G RVcr G Z* : Ai(cr) < ^ 2 ( 0 ’) + c, 

where Li and L 2 range over upper semicomputable superloss processes under 
A. (For a precise statement and proof, see [6], Theorem 1, Lemma 6, and 
Corollary 3; [6] only considers the case of a trivial one-element X, but the 
extension to the case of general X is easy.) For each computable mixable A 
(including the log. Brier, and spherical loss functions), fix such a smallest upper 
semicomputable superloss process; it will be denoted /C"^, and Af^(cr) will be 
called the predictive complexity of cr G Z* under A. The intuition behind 
is that this is the loss of the ideal prediction strategy whose computation is 
allowed to take an infinite amount of time. 

In this note we consider infinite data sequences C G Z°°, which are ideal¬ 
izations of long finite data sequences. If C = G Z°° and T is a 

nonnegative integer, we let to stand for the prefix zi... zt of C of length T. 

The randomness deficiency of ct G Z* with respect to a prediction algorithm 
F under a computable mixable loss function A is defined to be 

Dp{a) := Loss^(cr) - /C^(cr); (4) 

since Loss^ is upper semicomputable ([6], Section 3.1), the function Dp : Z* —s- 
R is bounded below. Notice that the indeterminacy 00 — 00 never arises in (4) 
as JC^ < 00 . We will sometimes replace the upper index A in any of the three 
terms of (4) by “In” in the case where A is the log loss function. 

Let us say that C G Z°° is random with respect to F under A if 

sup Dplfi'^) < 00 . 

T 

The intuition is that in this case F is an optimal prediction algorithm for Q 
under A. 

Log Randomness 

In the case where A is the log loss function and X is a one-element set, the 
predictive complexity of a finite data sequence a (which is now a binary sequence 
if we ignore the uninformative objects) is equal, to within an additive constant, 
to — lnM((7), where M is Levin’s a priori semimeasure. (In terms of this 
note, a semimeasure can be defined as a process of the form e~^ for some 
superloss process L under the log loss function; Levin’s a priori semimeasure is 
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a largest, to within a constant factor, lower semicomputable semimeasure.) The 
randomness dehciency D^p{a) of cr with respect to a prediction algorithm F is 
then, to within an additive constant, ln(M(CT)/P(cr)), where P is the probability 
measure corresponding to F, 


P{yi,-■ ■ ,yT) ■■= Pi-■-PT, Pt 


F{yi,... ,yt-i) if yt = 1 
^ - F{yi, ■ ■ ■ ,yt-i) ifyt = 0 


(we continue to ignore the objects, which are not informative). Therefore, 
D^p{a) is a version of the classical randomness deficiency of cr, and C € {0,1}°° 
is random with respect to F under the log loss function if and only if ( is random 
with respect to P in the sense of Martin-L6f. 


4 A Simple Statement of Fundamentality 

In this section, we consider computable proper mixable loss functions. 

Theorem 1. Let X be a CPM loss function. If a data sequence C € Z°° is 
random under the log loss funetion with respeet to a predietion algorithm F, it 
is random under A with respeet to F. 

A special case of this theorem is stated as Proposition 16 in [12]. 

Let us say that a CPM loss function A is fundamental if it can be used in 
place of the log loss function in Theorem 1. The proof of the theorem will in 
fact demonstrate its following quantitative form: for any computable rj > 0 and 
any computable proper ?]-mixable A there exists a constant c\ such that, for any 
prediction algorithm P, 

D^F>vDf-cx. (5) 

Let us define the mixability constant r]\ of a loss function A as the supremum 
of rj such that A is ? 7 -mixable. It is known that a mixable loss function A is 
r/A-mixable ([11], Lemmas 10 and 12); therefore, (5) holds for rj = ri\, provided 
r]\ is computable. 

If X is a one-element set (and so the objects do not play any role and can 
be ignored), the notion of randomness under the log loss function coincides 
with the standard Martin-Ldf randomness, as discussed in the previous section. 
Theorem 1 shows that other notions of randomness are either equivalent or 
weaker. 

A superprediction is a point in the plane that lies Northeast of the prediction 
set (2) (i.e., a point {x,y) G such that Ao(p) < x and Ai(p) < y for some 
pe [0,1]). 

Proof of Theorem 1. We will prove (5) for a fixed rj G (0, oo) such that rj is 
computable and A is ry-mixable. Let L be a superloss process under A and 
P be a prediction algorithm. Fix temporarily (ct, x) G Z* x X and set p := 
F{a,x) G [0,1]; notice that (a, 6) := (P(ct, a;,0) — L{a),L{a,x,l) — L{a)) is 
a A-superprediction. By the definition of ry-mixability there exists a parallel 
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translation of the curve e~'^^ + e~^'^ = 1 that passes through the point := 
(^o(p)) Ai(p)) and lies Southeast of the prediction set of A. Let h be the affine 
transformation of the plane mapping that translation onto the curve = 

1; notice that h is the composition of the scaling (x, y) i—>■ r]{x, y) by y and then 
parallel translation moving the point rj\^ to the point (— ln(l — p), — Inp). The 
A-superprediction (a, b) is mapped by h to the In-superprediction 

{ya + (-ln(l - p)) - yXo{p),yb + {-\np) - yXiip)) ■ 

We can see that yL + Loss’^ —y Loss^ is a superloss process under In. It is clear 
that this In-superloss process is upper semicomputable if L is. Therefore, for 
some constant c\, 

/C*" < yK,^ + Loss’^ —y Loss^ +ca, 

which is equivalent to (5). □ 


5 A Criterion of Fundamentality 


In this section, we only consider computable proper mixable loss functions that 
satisfy, additionally, the smoothness conditions. The main result of this section 
is the following elaboration of Theorem I for CPMS loss functions. 

Theorem 2. A CPMS loss function X is fundamental if and only if 

inf(l-p)Ao(p) > 0. (6) 

P 

Equivalently, it is fundamental if and only if 

mi{-p)X[{p) > 0. (7) 

p 


We can classify CPMS loss functions A by their degree 

deg(A) := inf |A: : Ag^^(O) ^ 0 and Aj^^(I) ^ o|, 

where stands for the fcth derivative and, as usual, inf 0 := oo. We will see 
later in this section that Theorem 2 can be restated to say that the fundamental 
loss functions are exactly those of degree 1. Furthermore, we will see that for a 
CPMS loss function A of degree 1 < fc < oo there exist a data sequence C, G Z°° 
and a prediction algorithm P such that f is random with respect to P under A 
while the randomness deficiency of with respect to P under the log 

loss function grows almost as fast as as T —>■ oo. 

Straightforward calculations show that the log loss function has degree 1 and 
the Brier and spherical loss functions have degree 2. 

In the proof of Theorem 2 we will need the notion of the signed curvature of 
the prediction curve (Ao(p), Xi{p)) at a point p € (0,1), which can be defined as 


X',{p)X'l{p)-X[ip)X'f{p) 

{X'oipy + 


( 8 ) 
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The mixability constant rjx (i.e., the largest rj for which A is vy-mixable) is 




= inf 

p 


k\{p) 

k\n{p)' 


Therefore, A is mixable if and only if 

inf —— > 0. 

P Kln(p) 

Lemma 1. A CPMS loss function A is fundamental if and only if 


sup 

p 


k\{p) 

k\n{p) 


< oo 


(9) 


(cf- (9);. 


The proof of the part “if” of Lemma 1 goes along the same lines as the proof 
of Theorem 1, and also shows that, if A and A are CPMS loss functions such 


that 


• r kx{p) 

rix ■■= inf — 

p k\n[p) 


> 0 


and 


Ha ■= sup 

p 


kA{p) 

klnip) 


< OO 


are computable numbers, then there exists ca,a G R such that, for any prediction 
algorithm F, 


HxDp > rjxDp — ca , a - 


We will call Ha the fundamentality constant of A (analogously to rjx being called 
the mixability constant of A). 

Notice that the log loss function (perhaps scaled by multiplying by a positive 
constant) is the only loss function for which the mixability and fundamentality 
constants coincide, ryin = -ffin- Therefore, fundamental CPMS loss functions can 
be regarded as log-loss-like. 

The part “only if” of Lemma 1 will be proved below, in the proof of Theo¬ 
rem 2. 

The computation of kx for the three basic loss functions using (8) gives: 


• For the log loss function, the result is 


kin{p) 


P{^-P) 

(p2 + (l-p)2)3/2- 


( 10 ) 


• For the Brier loss function, the result is 


^Brier(p) 


1 1 

2(p2 + (l_p)2)3/2' 


• For the spherical loss function, the result is 

^spher(p) — 1. 
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We can plug the expression (10) for the signed curvature of the log loss func¬ 
tion into Lemma 1 to obtain a more explicit statement. Because of the propriety 
of A, this statement can be simplified, which gives the following corollary. 

Corollary 1. A CPMS loss function X is fundamental if and only if 


X'o{p)X'{{p)-X[{p)X'^{p) 
/ a'(p)a;(p)(a;(p)-a'(p)) 


(11) 


Proof. In view of the expressions (8) and (10), the condition in Lemma 1 can 
be written as 

A'(p)A'/(p) - X[{p)X''{p) {p^ + jl-pf r/^ 
jF (A'(p)2 + AUp)2)3/2 p{l-p) ^ ■ 

Therefore, it suffices to check that 

(A(,(p)^ + AUp)^)^/^ (p^ + (1-p)^)3/2 

Ao(p)A'i(p)(A'i(p) - Ao(p)) p{l-p) 


The last equality follows from 

A'i(p) ^ p-1 
Kip) P 


( 12 ) 


which in turn follows from the propriety of A. 


□ 


It is instructive to compare the criterion (11) with the well-known criterion 


. X',ip)X'{ip) - X[ip)X'^{p) 

p a'(p)aUp)(aUp)-a(,(p)) ^ 


(13) 


for A being mixable (see, e.g., [5] or [7], Theorem 2; it goes back to [10], 
Lemma 1). The criterion (13) can be derived from (9) as in the proof of Corol¬ 
lary 1. 

Proof of Theorem 2. Differentiating (12) we obtain 


X'{{p)X',{p)-X[{p)y'{p) _2 

Kip)^ ^ ’ 


and the fundamentality constant (11) of A is 


sup 


p-^Kip)^ 


Kip)Kip)iKip) - Kip)) 


= sup 


Kip)iKip)/Kip))iKip)/Kip) - 1) 


n-2 


T X'oip)il-l/p)i-l/p) 


T A(,(p)(l-p)’ 


where we have used (12). This gives us (6); in combination with (12) we get (7). 



Let us now prove the part “only if” of Theorem 2 (partly following the 
argument given after Proposition 16 of [12]). According to (12), (6) and (7) are 
equivalent. Suppose that 

inf(l -p)A;,(p) = 0, 

p 

and let us check that A is not fundamental. By the smoothness assumptions, we 
have (1 — p)Aq(p) = 0 either for p = 0 or for p = 1. Suppose, for concreteness, 
that (1 — p)XqIp) = 0 for p = 0 (if (1 — p)Ao(p) = 0 for p = 1, we will have 
{—p)X'i{p) = 0 for p = 1, and we can apply the same argument as below for p = 1 
in place of p = 0). Let k be such that Aq^^ (0) > 0 but Aq ^ (0) = 0 for all i < k; we 
know that k >2 (the easy case where Aq*^( 0) = 0 for all i should be considered 
separately). Consider any data sequence C = {xi,yi,X 2 ,y 2 , ■ ■ ■) € Z“ in which 
all labels are 0: pi = P 2 = • • • = 0. We then have supj./C'"(C^) < oo and 
sup 7 ^/C^(C^) < oo. Let F be the prediction algorithm that outputs pt 
at step t, where e G (0,1 — 1/fc). Then ( is random with respect to F under A 
since the loss of this prediction algorithm over the first T steps is 

E^o(pt)<2^^^^p?+O(l) 

t=i t=i 


(we have used Taylor’s approximation for Aq) and the series ^^Pt is convergent. 
On the other hand, the randomness dehciency of with respect to F under 
the log loss function grows as 


T T 

-^ln(l -pt) ~ ^pt 


k — 1 — ke 


^ 1 — 1 jk—e 


□ 


Notice that the criterion of mixability (13) can be simplified when we 
use (12): it becomes 


sup(l -p)Ao(p) < oo 

P 


or, equivalently, 


sup(-p)A'i(p) < oo. 

p 

The function (1 — p)Aq(p) = {—p)X[{p) can be computed as 
• 1 in the case of the log loss function; 


• 2p(l — p) in the case of the Brier loss function; 

• p(l — p)(p^ + (1 — p)^)“^^^ in the case of the spherical loss function. 

Therefore, all three loss functions are mixable, but only the log loss function is 
fundamental. 

It is common in experimental machine learning to truncate allowed proba¬ 
bilistic predictions to the interval [e, 1 — e] for a small constant e > 0 (this boils 
down to cutting off the ends of the prediction sets corresponding to the slopes 
below e and above 1 — e). It is easy to check that in this case all CPMS loss 
functions lead to the same notion of randomness. 
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Corollary 2. CPMS loss functions A and A restricted to p € [e, 1 — e], where 
e > 0, lead to the same notion of randomness. 

We can make the corollary more precise as follows: for prediction algorithms 
F restricted to [e, 1 — e], Dp and Dp coincide to within a factor of 


max 


sup , , , , oujj 

pe[£,l-e] ua(P) pe[e,l- 


k\{p) 


sup 


kA{p) \ 

] k\{p) j 


and an additive constant. 


6 Frequently Asked Questions 

This section is more discursive than the previous ones; “frequently” in its title 
means “at least once” (but with a reasonable expectation that a typical reader 
might well ask similar questions). 

What is the role of the requirement of propriety in Theorem 1 ? 

The theorem says that the log loss function leads to the most restrictive notion 
of randomness: if a sequence is random with respect to some prediction algo¬ 
rithm under the log loss function, then it is random with respect to the “same” 
prediction algorithm under an arbitrary CPM loss function. One should ex¬ 
plain, however, what is meant by the same prediction algorithm, because of the 
freedom in parameterization (say, we can replace each prediction p hy p^). The 
requirement of propriety imposes a canonical parameterization. 

What is the role of the requirement of mixability in Theorem 1 ? 

The requirement of mixability ensures the existence of predictive complexity, 
which is used in the definition of predictive randomness. 

Mixability is sufficient for the existence of predictive complexity (for computable 
loss functions). Is it also necessary? 

Yes, it is: see Theorem 1 in [8]. 

What is the geometric intuition behind the notions of propriety and mixability? 

The intuitions behind the two notions overlap; both involve requirements of 
convexity of the “superprediction set” (the area Northeast of the prediction set 
(2)). Let us suppose that the loss function A is continuous in the prediction p, so 
that the prediction set is a curve. Propriety then means that the superprediction 
set is strictly convex (in particular, the prediction set has no straight segments) 
and that the points on the prediction set are indexed in a canonical way (namely, 
each such point is indexed by 1/(1 — s) where s < 0 is the slope of the tangent 
line to the prediction set at that point: cf. (12)). Mixability means that the 
superprediction set is convex in a stronger sense: it stays convex after being 
transformed by the mapping {x,y) £ [0, oo]^ i-j- (e“’'“,e“™) for some p > 0. 
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Why should we consider not only the log loss function (which nicely corresponds 
to probability distributions) but also other loss functions ? You say “the log loss 
function, being most selective, should be preferred to the alternatives such as 
Brier or spherical loss”. But this does not explain why these other loss functions 
were interesting in the first place. 

Loss functions different from the log loss function are widely used in practice; 
in particular, the Brier loss function is at least as popular as (and perhaps even 
more popular than) the log loss function in machine learning: see, e.g., the 
extensive empirical study [2]. An important reason for the popularity of Brier 
loss is that the log loss function often leads to infinite average losses on large 
test sets for state-of-the-art prediction algorithms, which is considered to be 
“unfair”, and some researchers even believe that any reasonable loss function 
should be bounded. 


7 Conclusion 

This note offers an answer to the problem of choosing a loss function for evalu¬ 
ating probabilistic prediction algorithms in experimental machine learning. Our 
answer is that the log loss function, being most selective, should be preferred to 
the alternatives such as Brier or spherical loss. 
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